The aim of this project is to investigate how the performance of elite runners is linked to factors like height, weight and age. This will be done by analysing historical data about athlete performances through history, mainly in modern Olympic running events. Understanding the relationship between these characteristics and athlete performance would be extremely helpful in developing training plans and improving athlete performance in the future.
Athlete performance is clearly affected by many factors, and this analysis will be limited to just a few of them, dictated mainly by the data available. The specific questions addressed here are:
There will be no particular modelling or machine learning in this analysis because the questions can be answered by visualising the statistics alone.
# Import libraries
import pandas as pd
import chardet # For character encoding
import ftfy # For fixing encoding issues
from matplotlib import pyplot as plt
from matplotlib import pylab as plb
from datetime import datetime, time, timedelta, date
import numpy as np
from fuzzywuzzy import fuzz # For inexact ("fuzzy") string matching
from fuzzywuzzy import process
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters() # For future compatibility when plotting with datetime
This analysis will attempt some originality by combining three separate data sets. This allows athlete characteristics to be linked to athlete performances so any relationship between the two can be investigated.
First, load the data sets and briefly examine them.
The first data set is the Olympic Games results and athlete data, 1896-2016.
Source:
# Results data is in the first file:
all_olympics = pd.read_csv('datasets/athlete_events.csv')
all_olympics.head()
all_olympics.info()
Summary
The full Olympic Games data set contains useful information over a 120 year period about the competitiors (height, weight, age, country of origin, and medal, if they won one).
The second data set contains the Olympic track and field times and results. Note it only includes data for medal winners. Source:
# Data set 2 - Olympic track and field times and results. Source:
# https://www.kaggle.com/jayrav13/olympic-track-field-results/downloads/olympic-track-field-results.zip/1
# There is an additional column in a few of the rows. This is unlabelled so not useful in this analysis.
# Therefore, read explicitly labelled columns and disgard the unlabelled column.
ol_tf = pd.read_csv('datasets/results.csv', names=['Gender',
'Event',
'Location',
'Year',
'Medal',
'Name',
'Nationality',
'Result'])
ol_tf.drop(index=0, inplace=True)
ol_tf.head()
ol_tf.info()
Summary
The most useful feature of the track and field results is the detailed running times and event results. This will be linked to the full Olympic data (including its information on the athletes' characteristics) later in the analysis.
The third data set contains the top 1000 running performances for each running event.
Source:
https://www.kaggle.com/jguerreiro/running/downloads/running.zip/2
top_running = pd.read_csv('datasets/data.csv')
top_running.head()
top_running.info()
Summary
This data set is good because it contains a large number of data points (1000) including finish times for every running discipline. It is not limited to Olympic performances, but all the events are Olympic distances, with the exception of the half marathon.
print("Number of unique events is {}".format(len(all_olympics['Event'].unique())))
765 events is far too many to analyse. It also includes some events which have not taken place in the Olympics for a long time. This analysis is focussed on modern running events, so we will extract a subset of the results.
olympic_sports_groups = all_olympics.groupby('Sport')
athletics = olympic_sports_groups.get_group('Athletics')
all_athletics_events = athletics['Event'].unique()
all_athletics_events
This is a more manageable list of events. There are still some events here that don't exist in the modern Games. The next step is to remove any events that didn't take place in the most recent summer Games (2016).
modern_athletics_events = athletics[athletics['Year']==2016]['Event'].unique()
modern_athletics_events
removed_events = set(all_athletics_events).difference(modern_athletics_events)
removed_events
indices_to_remove = [athletics.index[i] for i in range(len(athletics)) if
athletics['Event'].iloc[i] in removed_events]
modern_athletics = athletics.drop(index=indices_to_remove)
modern_athletics['Event'].unique()
This analysis will focus on individual running events. So, now remove the field events and non-running events.
# These are the events to keep for the analysis.
modern_individual_running_events = {"Athletics Women's 100 metres",
"Athletics Men's 1,500 metres",
"Athletics Men's 5,000 metres",
"Athletics Men's 110 metres Hurdles",
"Athletics Women's Marathon",
"Athletics Men's 100 metres",
"Athletics Men's 400 metres Hurdles",
"Athletics Men's 400 metres",
"Athletics Men's 800 metres",
"Athletics Men's Marathon",
"Athletics Men's 10,000 metres",
"Athletics Men's 200 metres",
"Athletics Men's 3,000 metres Steeplechase",
"Athletics Women's 200 metres",
"Athletics Women's 5,000 metres",
"Athletics Women's 10,000 metres",
"Athletics Women's 1,500 metres",
"Athletics Women's 800 metres",
"Athletics Women's 400 metres",
"Athletics Women's 400 metres Hurdles",
"Athletics Women's 100 metres Hurdles",
"Athletics Women's 3,000 metres Steeplechase"}
removed_events = set(modern_athletics_events).difference(modern_individual_running_events)
removed_events
indices_to_remove = [modern_athletics.index[i] for i in range(len(modern_athletics)) if
modern_athletics['Event'].iloc[i] in removed_events]
ol_running = modern_athletics.drop(index=indices_to_remove)
ol_running.head()
# Check for missing values in each column.
ol_running.isnull().sum()
Many rows have no entry for a medal, and this is expected - many competitors do not win a medal, so there is no special treatment needed for missing values in the Medal feature. There are also a lot of missing values for height, weight and age, these will be examined now.
age_missing = ol_running[ol_running['Age'].isnull()]
weight_missing = ol_running[ol_running['Weight'].isnull()]
height_missing = ol_running[ol_running['Height'].isnull()]
age_missing.head()
weight_missing.head()
height_missing.head()
We now have three groups of rows that have at least one missing value. Now find out if they overlap by using sets:
age_missing_indices = set(age_missing.index)
weight_missing_indices = set(weight_missing.index)
height_missing_indices= set(height_missing.index)
print("The number of rows where both height and weight are missing is {}".format(
len(weight_missing_indices.intersection(height_missing_indices))))
print("The number of rows where both age and weight are missing is {}".format(
len(age_missing_indices.intersection(weight_missing_indices))))
print("The number of rows where both age and height are missing is {}".format(
len(age_missing_indices.intersection(height_missing_indices))))
print("The number of rows where age, height and weight are missing is {}".format(
len(age_missing_indices.intersection(height_missing_indices,
weight_missing_indices))))
Of the rows where either height (2987) or weight (3131) are missing, most (2961) of them are missing both height and weight. Of the rows where age is missing (667), most (at least 504) are also missing either weight, height or both. The overlap between the missing data sets is large, which is good news, because it means more of the rows are fully populated, so more of this data is usable without dropping data or imputation. For now, all the data will be kept (not dropping rows with missing data).
The Event feature is a categorical variable. This will be encoded as follows:
This method of encoding is chosen because it groups together similar types of events (e.g., hurdles events are treated as a group, flat track events are treated as a separate group) and also separates them by the distance of each event (100m, 200m, etc.)
# Simple string processing in Event column
ol_running['Event'] = ol_running['Event'].str.replace("Athletics Women's ", "")
ol_running['Event'] = ol_running['Event'].str.replace("Athletics Men's ", "")
ol_running['Event'] = ol_running['Event'].str.replace(" metres", "")
ol_running['Event'] = ol_running['Event'].str.replace(",", "")
ol_running.head()
Adding the new columns, copying the values between columns and removing duplicates is repetetive so write a function for this:
def encode_events(df, col, to_replace, replacement):
"""
Helper function to insert new columns, copy and convert values to the correct column
"""
# Insert new column
df.insert(df.columns.get_loc('Event'), col, 0)
# Copy values across to new column
df.loc[df['Event'].str.contains(to_replace), col] = df['Event'].str.replace(to_replace,
replacement)
# Remove values from original column
df.loc[df[col] != 0, 'Event'] = '0'
def string_to_int(df, features):
"""
Helper function to cast string values to integers.
"""
for feature in features:
df[feature] = pd.to_numeric(df[feature], downcast='integer')
new_columns = ['Hurdles', 'Road', 'Steeplechase']
to_replace = [' Hurdles', 'Marathon', ' Steeplechase']
replacement = ['', '42195', '']
for i in range(len(new_columns)):
encode_events(ol_running, new_columns[i], to_replace[i], replacement[i])
ol_running.rename(columns={'Event': 'Track_Flat'}, inplace=True)
# Several features now contain strings that would be easier to use as integers.
# Convert these to integers now.
columns_to_int = ['Track_Flat', 'Hurdles', 'Steeplechase', 'Road', 'Year']
string_to_int(ol_running, columns_to_int)
The other two data sets refer to this as 'Gender'. For ease of comparison, change the name of this feature from 'Sex' to ''Gender'.
ol_running.rename(columns={'Sex': 'Gender'}, inplace=True)
For ease of comparison with the other data sets, convert 'Gold' to 'G', 'Silver' to 'S', and 'Bronze to 'B'
medals = ['Gold', 'Silver', 'Bronze']
short_medals = ['G', 'S', 'B']
for i in range(len(medals)):
ol_running.loc[ol_running['Medal'] == medals[i], 'Medal'] = ol_running[ol_running['Medal'] == medals[i]
]['Medal'].str.replace(medals[i],
short_medals[i])
The 'Name' column also looks difficult to use:
ol_running['Name'].head()
There are alternative names/nicknames in parentheses and double quotes. The intention is to use the names later on, so to make this easier, remove sections in parentheses and double quotes, and convert the name string to lowercase. Make this a function so it can be used on the other data sets later on.
def process_names(df):
"""
Helper function to perform some cleaning on the athlete Name field.
"""
df.rename(columns={'Name': 'RawName'}, inplace=True)
df.insert(loc = df.columns.get_loc('RawName'), column = 'Name', value=np.NaN)
df['Name'] = df['RawName'].str.replace('\"(.*?)\"', '')
df['Name'] = df['Name'].str.replace('\((.*?)\)', '')
df['Name'] = df['Name'].str.lower()
process_names(ol_running)
ol_running.head()
Wrangling of this data set is complete, and from here on the cleaned data frame will always be called ol_running.
print("Number of unique events is {}".format(len(ol_tf['Event'].unique())))
ol_tf.head()
all_ol_tf_events = ol_tf['Event'].unique()
all_ol_tf_events
As in the previous section, this analysis will keep the individual running events and drop the remainder.
ol_tf_running_events = {'10000M Men', '100M Men', '110M Hurdles Men', '1500M Men',
'200M Men', '3000M Steeplechase Men',
'400M Hurdles Men', '400M Men', '5000M Men',
'800M Men', 'Marathon Men', '10000M Women', '100M Hurdles Women',
'100M Women', '1500M Women', '200M Women',
'3000M Steeplechase Women', '400M Hurdles Women', '400M Women',
'5000M Women', '800M Women', 'Marathon Women'}
indices_to_remove = [ol_tf.index[i] for i in range(len(ol_tf))
if not ol_tf['Event'].iloc[i] in ol_tf_running_events]
ol_tf_running = ol_tf.drop(index=indices_to_remove)
ol_tf_running.head()
ol_tf_running['Event'].unique()
In addition, this data set contains results for a Men's 3000m steeplechase in 1900 and 1904. However, this is an error in the data - the 1900 and 1904 Olympics featured shorter steeplechase ditances (source: https://en.wikipedia.org/wiki/Steeplechase_(athletics)). Therefore the rows for 3000M Steeplechase Men for 1900 and 1904 will be removed.
drop_steeplechase = ol_tf_running[((ol_tf_running['Year'] == '1900') |
(ol_tf_running['Year'] == '1904')) & (ol_tf_running['Event']
== '3000M Steeplechase Men')].index.tolist()
ol_tf_running.drop(index=drop_steeplechase, inplace=True)
This now contains the data of interest.
# Check for missing values in each column.
ol_tf_running.isnull().sum()
No missing values are shown but this is deceptive, since some of the 'Result' fields conatin the string 'None'.
ol_tf_running[ol_tf_running['Result'] == 'None'].head()
ol_tf_running.loc[ol_tf_running['Result'] == 'None', 'Result'] = pd.NaT
ol_tf_running.dropna(subset=['Result'], inplace=True)
The same approach will be used as in the previous section so that the data sets end up with a consistent set of labels for each event.
# Simple string processing in Event column
ol_tf_running['Event'] = ol_tf_running['Event'].str.replace("Women", "")
ol_tf_running['Event'] = ol_tf_running['Event'].str.replace("Men", "")
ol_tf_running['Event'] = ol_tf_running['Event'].str.replace("M ", "")
ol_tf_running['Event'] = ol_tf_running['Event'].str.replace(",", "")
new_columns = ['Hurdles', 'Road', 'Steeplechase']
to_replace = ['Hurdles', 'Marathon', 'Steeplechase']
replacement = ['', '42195', '']
for i in range(len(new_columns)):
encode_events(ol_tf_running, new_columns[i], to_replace[i], replacement[i])
ol_tf_running.rename(columns={'Event': 'Track_Flat'}, inplace=True)
The aim is to convert the Results string to a datetime object, extract the time from this and store it in a feature called 'Time'. The time formats vary a lot in this data set so some cleaning is needed.
It's possible to create general groups of events that share similar formats.
# Hurdle events
ol_tf_running_hurdles_groups = ol_tf_running.groupby('Hurdles')
# Road running events
ol_tf_running_road_groups = ol_tf_running.groupby('Road')
# Steeplechase
ol_tf_running_steeplechase_groups = ol_tf_running.groupby('Steeplechase')
# Track (flat) events
ol_tf_running_trackf_groups = ol_tf_running.groupby('Track_Flat')
event_groups = [ol_tf_running_hurdles_groups,
ol_tf_running_road_groups,
ol_tf_running_steeplechase_groups,
ol_tf_running_trackf_groups]
for group in event_groups:
for event in list(group.groups.keys())[1:]: # Ignore the first event in each category where distance=0
print("Event: {}".format(event))
print(group.get_group(event)['Result'].head(3))
This shows it's possible to define three time formats in this result set:
# Time format for the sprint events
time_format_sprints = '%S.%f'
# Time format for middle distance events
time_format_middle = '%M:%S.%f'
# Time format for long distance events
time_format_long = '%H:%M:%S'
Examining each event in more detail shows that some further processing is needed.
Steeplechase
ol_tf_running_steeplechase_groups.get_group('3000 ')['Result'].head()
# Convert to datetime and extract the time part only.
ol_tf_running.loc[ol_tf_running['Steeplechase'] == '3000 ', 'Time'] = pd.to_datetime(
ol_tf_running[ol_tf_running['Steeplechase'] == '3000 ']['Result'],
format=time_format_middle).apply(datetime.time)
Hurdles
ol_tf_running_hurdles_groups.get_group('100 ')['Result'].head()
ol_tf_running_hurdles_groups.get_group('110 ')['Result'].head()
ol_tf_running_hurdles_groups.get_group('400 ')['Result'].head()
In addition, some of the time strings have a leading '0:':
ol_tf_running[ol_tf_running['Result'] == '0:54.0']
# Remove leading '0:':
ol_tf_running.loc[ol_tf_running['Hurdles'] == '400 ', 'Result'] = ol_tf_running[
ol_tf_running['Hurdles'] == '400 ']['Result'].str.replace('0:', '')
# For all the Hurdles distances - convert to datetime and extract the time part only.
events = list(ol_tf_running_hurdles_groups.groups.keys())
events.remove(0) # Ignore the fist event in each category where distance=0
for event in events:
ol_tf_running.loc[ol_tf_running['Hurdles'] == event, 'Time'] = pd.to_datetime(
ol_tf_running[ol_tf_running['Hurdles'] == event]['Result'],
format=time_format_sprints).apply(datetime.time)
Track (Flat)
ol_tf_running_trackf_groups.get_group('100')['Result'].head()
ol_tf_running_trackf_groups.get_group('200')['Result'].head()
ol_tf_running_trackf_groups.get_group('400')['Result'].head()
ol_tf_running_trackf_groups.get_group('800')['Result'].head()
ol_tf_running_trackf_groups.get_group('1500')['Result'].head()
ol_tf_running_trackf_groups.get_group('5000')['Result'].head()
ol_tf_running_trackf_groups.get_group('10000')['Result'].head()
Track events for distances less than 800m all have times written in the format defined in time_format_sprints. 800m and above use the format defined in time_format_middle.
sprint_distances = ['100', '200', '400']
middle_distances = ['800', '1500', '5000', '10000']
As with the hurdles distances above, remove any leading '0:':
# Remove leading '0:':
for event in sprint_distances:
ol_tf_running.loc[ol_tf_running['Track_Flat'] == event, 'Result'] = ol_tf_running[
ol_tf_running['Track_Flat'] == event]['Result'].str.replace('0:', '')
# For the track sprint events - convert to datetime and extract the time part only.
for event in sprint_distances:
ol_tf_running.loc[ol_tf_running['Track_Flat'] == event, 'Time'] = pd.to_datetime(
ol_tf_running[ol_tf_running['Track_Flat'] == event]['Result'],
format=time_format_sprints).apply(datetime.time)
# For the track middle distance events - convert to datetime and extract the time part only.
for event in middle_distances:
ol_tf_running.loc[ol_tf_running['Track_Flat'] == event, 'Time'] = pd.to_datetime(
ol_tf_running[ol_tf_running['Track_Flat'] == event]['Result'],
format=time_format_middle).apply(datetime.time)
Road
ol_tf_running_road_groups.get_group('42195 ').head()
Some specific examples show there are several problems:
ol_tf_running_road_groups.get_group('42195 ').loc[[1379]]
ol_tf_running_road_groups.get_group('42195 ').loc[[1392]]
ol_tf_running_road_groups.get_group('42195 ').loc[[1417]]
This shows several formatting problems:
Taking these in turn:
# Remove 'h'
ol_tf_running.loc[ol_tf_running['Road'] == '42195 ', 'Result'] = ol_tf_running[
ol_tf_running['Road'] == '42195 ']['Result'].str.replace('h', ':')
# Remove milliseconds:
ol_tf_running.loc[ol_tf_running['Road'] == '42195 ', 'Result'] = ol_tf_running[
ol_tf_running['Road'] == '42195 ']['Result'].str.replace('\..*', '')
# Replace '-' with ':'
ol_tf_running.loc[ol_tf_running['Road'] == '42195 ', 'Result'] = ol_tf_running[
ol_tf_running['Road'] == '42195 ']['Result'].str.replace('-', ':')
ol_tf_running[ol_tf_running['Road'] == '42195 ']['Result'].head()
There are also some values that only include hours and minutes:
ol_tf_running[ol_tf_running['Result'] == '2:32']
for i in ol_tf_running[ol_tf_running['Road'] == '42195 '].index:
if len(ol_tf_running['Result'].loc[i].split(':')) < 3:
ol_tf_running['Result'].loc[i] = ol_tf_running['Result'].loc[i] + ':00'
# Convert to datetime and extract the time part only.
ol_tf_running.loc[ol_tf_running['Road'] == '42195 ', 'Time'] = pd.to_datetime(
ol_tf_running[ol_tf_running['Road'] == '42195 ']['Result'],
format=time_format_long).apply(datetime.time)
# Several features now contain strings that would be easier to use as integers.
# Convert these to integers now.
columns_to_int = ['Track_Flat', 'Hurdles', 'Steeplechase', 'Road', 'Year']
string_to_int(ol_tf_running, columns_to_int)
Some names include nicknames in double quotes. There is also a string encoding problem causing some characters to be displayed wrongly. For example, 'Emil ZÃTOPEK' and 'Katrin DÃRRE' below:
ol_tf_running['Name'].loc[25]
ol_tf_running['Name'].loc[2322]
# Check encoding of the file
with open("datasets/results.csv", 'rb') as file:
print(chardet.detect(file.read()))
So chardet still suggests the file is utf-8 encoded. So we can try to clean this up by using the ftfy package to fix the bad encodings (Reference for ftfy: https://ftfy.readthedocs.io/en/latest/)
ol_tf_running['Name'] = ol_tf_running['Name'].apply(ftfy.fix_encoding)
ol_tf_running['Name'].loc[25]
ol_tf_running['Name'].loc[2322]
This shows the bad encodings have disappeared:
Other name text processing is the same as the previous section
# Use the processing function defined previously
process_names(ol_tf_running)
ol_tf_running['Name'].head()
Female athletes are categorised as 'W' in the 'Gender' column. Change this to be 'F' for consistency with the other data sets.
ol_tf_running['Gender'] = ol_tf_running['Gender'].str.replace('W', 'F')
This concludes cleaning of the second data set, which will be named ol_tf_running from here on.
Select individual running events as with the previous two data sets.
print("Number of unique events is {}".format(len(top_running['Event'].unique())))
top_running['Event'].unique()
These are all valid events for this analysis. No need to remove any.
top_running.isnull().sum()
There are a few missing 'Place' values. This anlysis will not use this feature and it will not be included any further analysis anyway. No further action on this for now.
top_running.head()
The same approach will be used as in the previous section so that the data sets end up with a consistent set of labels for each event.
# Simple string processing in Event column
# Replace the race type strings ('Marathon', 'Half marathon') with their distance in metres:
racetype = ['Marathon', 'Half marathon']
distance = ['42195 Road', '21098 Road']
for i in range(len(racetype)):
top_running.loc[top_running['Event'] == racetype[i], 'Event'] = top_running[
top_running['Event'] == racetype[i]]['Event'].str.replace(racetype[i], distance[i])
top_running['Event'] = top_running['Event'].str.replace(",", "")
new_columns = ['Road']
to_replace = [' Road']
replacement = ['']
for i in range(len(new_columns)):
encode_events(top_running, new_columns[i], to_replace[i], replacement[i])
top_running['Event'] = top_running['Event'].str.replace(" m", "")
top_running.rename(columns={'Event': 'Track_Flat'}, inplace=True)
top_running.head()
# In the 'Date' column, the year will be used as one of the keys to merge the data sets.
# Therefore, create a separate 'Year' column and populate it.
top_running.insert(top_running.columns.get_loc('Date'), 'Year', 0)
top_running['Year'] = top_running['Date'].str.split("-", expand=True)[0]
#top_running.rename(columns={'Date': 'Year'}, inplace=True)
# Several features now contain strings that would be easier to use as integers.
# Convert these to integers now.
columns_to_int = ['Track_Flat', 'Road', 'Year']
string_to_int(top_running, columns_to_int)
Convert the string into a datetime object. First look at what the different time formats used in each event are:
# Road running events
top_running_road_groups = top_running.groupby('Road')
# Track (flat) events
top_running_trackf_groups = top_running.groupby('Track_Flat')
event_groups = [top_running_road_groups,
top_running_trackf_groups]
for group in event_groups:
for event in list(group.groups.keys())[1:]: # Ignore the first event in each category where distance=0
print("Event: {}".format(event))
print(group.get_group(event)['Time'].head(3))
For the road running events, the time format is the same as already defined in time_format_long above. For the track events, most of the times have the same format ('%H:%M:%S.%f'), but there are occassional cases where the milliseconds field is missing. For these cases it's possible to use the infer_datetime_format feature of pandas.to_datetime().
# Convert strings in the Time column to datetime objects for the road running events.
# Convert to datetime and extract the time part only.
events = top_running['Road'].unique().tolist()
events.remove(0)
for event in events:
top_running.loc[top_running['Road'] == event, 'Time'] = pd.to_datetime(
top_running[top_running['Road'] == event]['Time'],
format=time_format_long).apply(datetime.time)
# Convert strings in the Time column to datetime objects for the track running events.
# Convert to datetime and extract the time part only.
events = top_running['Track_Flat'].unique().tolist()
events.remove(0)
for event in events:
top_running.loc[top_running['Track_Flat'] == event, 'Time'] = pd.to_datetime(
top_running[top_running['Track_Flat'] == event]['Time'],
infer_datetime_format=True).apply(datetime.time)
Looking at the names in this data set - they seem straightforward:
top_running['Name'].head()
Other name text processing is the same as the previous section
# Use the processing function defined previously
process_names(top_running)
To be consistent with the other data sets, change the possible values of the 'Gender' feature to be either 'M' or 'F' instead of 'Men' or 'Women'.
top_running['Gender'] = ['M' if top_running['Gender'].iloc[i]=='Men'
else 'F' for i in top_running.index]
top_running.head()
Later in this analysis, times from this data set will be merged into the Olympic data set. To facilitate this, it is necessary to label which rows correspond to an Olympic Games. This will be done by comparing the 'Date' field of the result to the known dates of the Olympic Games.
# Convert dates to datetime format
top_running['Date'] = pd.to_datetime(top_running['Date'], infer_datetime_format=True)
# What's the earliest year in the top_running data set?
min(top_running['Year'].tolist())
So there is no need to look at years before 1962.
# List of dates of Olympic summer games
# Source: https://en.wikipedia.org/wiki/Summer_Olympic_Games
# Use format Year-month-day
olympic_dates = [
['1964-10-10', '1964-10-24'],
['1968-10-12', '1968-10-27'],
['1972-08-26', '1972-09-10'],
['1976-07-17', '1976-08-01'],
['1980-07-19', '1980-08-03'],
['1984-07-28', '1984-08-12'],
['1988-09-17', '1988-10-02'],
['1992-07-25', '1992-08-09'],
['1996-07-19', '1996-08-04'],
['2000-09-15', '2000-10-01'],
['2004-08-13', '2004-08-29'],
['2008-08-08', '2008-08-24'],
['2012-07-27', '2012-08-12'],
['2016-08-05', '2016-08-21']
]
olympic_dates_df = pd.DataFrame(olympic_dates, columns=['Start', 'End'],
index=[1964,
1968,
1972,
1976,
1980,
1984,
1988,
1992,
1996,
2000,
2004,
2008,
2012,
2016])
olympic_dates_df['Start'] = pd.to_datetime(olympic_dates_df['Start'], format='%Y-%m-%d')
olympic_dates_df['End'] = pd.to_datetime(olympic_dates_df['End'], format='%Y-%m-%d')
top_running.insert(loc=top_running.columns.get_loc('Date'), column='Olympics', value=False)
for y in olympic_dates_df.index:
top_running.loc[top_running['Year'] == y, 'Olympics'] = (top_running['Year'] == y) & (
top_running['Date'] >= olympic_dates_df.loc[y, 'Start']) & (
top_running['Date'] <= olympic_dates_df.loc[y, 'End'])
top_running[top_running['Olympics']== True].head()
It is useful to label the top 10 performances in each event, for each gender, and for every year. This is because the data set includes the top 1000 performances for all events, and since the main concern for this analysis is the factors affecting the improvement of performances, it is worth identifying the top 10 performances in each year.
top_running.insert(loc=top_running.columns.get_loc('Time'), column='Top 10', value=False)
event_categories = ['Track_Flat', 'Road']
for gender in top_running['Gender'].unique().tolist():
for category in event_categories:
events = top_running[category].unique().tolist()
events.remove(0)
events.sort()
for event in events:
for year in top_running[top_running[category] == event]['Year'].unique().tolist():
#debug
# print("Category {}, Event {}, Year {}, Gender {}".format(category, event, year, gender))
best_times_per_year = top_running[(top_running[category] == event) &
(top_running['Year'] == year) &
(top_running['Gender'] == gender)]['Time'].tolist()
if len(best_times_per_year) > 0:
if len(best_times_per_year) >= 10:
cutoff = sorted(best_times_per_year)[9]
#print(cutoff) # For debugging
else:
cutoff = sorted(best_times_per_year)[len(best_times_per_year)-1]
#debug
#print(sorted(best_times_per_year)[len(best_times_per_year)-1])
top_running.loc[(top_running[category] == event) &
(top_running['Year'] == year) & (top_running['Gender'] == gender),
'Top 10'] = top_running.loc[(top_running[category] == event) &
(top_running['Year'] == year) &
(top_running['Gender'] == gender)]['Time'] <= cutoff
else:
continue
Perform a check that this has worked:
top_running[(top_running['Road'] == 42195) & (
top_running['Top 10'] == True) & (
top_running['Year'] == 2012)].head()
This concludes the processing of this data set, and the data frame will be named top_running from this point on.
The full Olympic data set has information about athlete characteristics but no times or results. Both the track and field data set and the top running times data set have times and results, but no athlete data. So to answer questions about how results and athlete characteristics are related it is necessary to merge these data sets. Athletes often compete in multiple Olympic Games and in different events, so it will be necessary to find a match based on the year, the event, medal awardd and the athlete's name. It will be straightforward to match the year across both data sets, and also the events and medals, because the labels are already standardised. The name presents an additional challenge because it is written differently in each data set for some athletes appearing in both. For example, here is how Mo Farah's performance in the 10000 m in 2016 looks in the Olympic track and field data set:
ol_tf.loc[[1]]
Compare the way his name is written to the way it appears for the same performance in the full Olympic data set:
ol_running.loc[[66487]]
Row 66487 contains Mo Farah's performance matching the one in the Olympic track and field data, but the name is written very differently. To overcome this, a method called fuzzy matching will be used.
The aim is to merge the time data into the ol_running data frame, where it is available.
ol_running.head()
Add two columns to ol_running, one for the time and one for the merged-in name, which can be used as a sanity check for the data merging process.
ol_running.insert(loc=len(ol_running.columns), column='Time', value=pd.NaT)
ol_running.insert(loc=ol_running.columns.get_loc('RawName'), column='Merged_name', value=np.NaN)
ol_running.insert(loc=ol_running.columns.get_loc('RawName'), column='Ratio', value=np.NaN)
Now define a function to merge the times from the ol_tf_running data set into the ol_running data set. This function splits the results in each data set into groups by event, year, gender and medal awarded. This cuts the full results set into much smaller and more manageable groups. Every pass of the loop examines a pair of corresponding groups, one from each data set. Each group of results is for the same set of event, year, gender and medal awarded. The function then compares the names in each. If the strings don't match in a simple way (using str.find(), then it applies the fuzzy matching algorithm to find the best match (process.extractOne()). If the match ratio between the two strings being compared is above a threshold (chosen arbitrarily as 50) then use the two rows being compared as a match, and save the time, name and ratio in the ol_running dataset.
def merge_times(df, event_categories, debug=False):
"""
Helper function to merge times from one dataset into the ol_running dataset.
Event, year, gender, medal and athlete name are used as inputs to match athlete data from
one data frame to the same athlete's performance in the other data frame.
Names are matched using fuzzy string matching.
Input parameters:
df - data frame to merge
event_categories - list of categories of events
debug - True/False flag to indicate whether to print out debugging information.
Returns:
None
"""
for category in event_categories:
if debug:
print(category)
ol_running_groups = ol_running.groupby([category, 'Year', 'Gender', 'Medal'])
df_groups = df.groupby([category, 'Year', 'Gender', 'Medal'])
events = ol_running[category].unique().tolist()
events.remove(0)
for event in events:
if debug:
print(event)
for gender in ol_running['Gender'].unique().tolist():
if debug:
print(gender)
for year in ol_running['Year'].unique().tolist():
if debug:
print(year)
for medal in ol_running['Medal'].unique().tolist():
if debug:
print(medal)
try:
group_1 = ol_running_groups.get_group((event, year, gender, medal))
except KeyError:
if debug:
print("No results for this combination in ol_running_groups")
continue
try:
group_2 = df_groups.get_group((event, year, gender, medal))
except KeyError:
if debug:
print("No results for this combination in df_groups")
continue
name_options = group_1['Name'].tolist()
for name in group_2['Name']:
find_result = group_1['Name'].str.find(name)
i = find_result[find_result>-1].index
if debug:
print(i)
if(i.any()):
if debug:
print("str.find found a match: {}".format(name))
# Don't replace a time if one has already been found for this row
if pd.isnull(ol_running.loc[i]['Time'].tolist()):
ol_running.loc[i, 'Time'] = group_2.loc[group_2['Name']==name]['Time'].tolist()
else:
if debug:
print("str.find did NOT find a match:")
best_match = process.extractOne(name, name_options)
if debug:
print(best_match)
print("Best name: {}".format(best_match[0]))
print("Match confidence: {}".format(best_match[1]))
print("index={}".format(group_1[group_1['Name']==best_match[0]].index))
if best_match[1] > 50:
i=group_1[group_1['Name']==best_match[0]].index
# Don't replace a time if one has already been found for this row
if pd.isnull(ol_running.loc[i]['Time'].tolist()):
ol_running.loc[i, 'Merged_name'] = name
ol_running.loc[i, 'Ratio'] = best_match[1]
ol_running.loc[i, 'Time'] = group_2.loc[group_2['Name']==name]['Time'].tolist()
event_categories = ['Track_Flat', 'Hurdles', 'Road', 'Steeplechase']
merge_times(ol_tf_running, event_categories)
Check how well the fuzzy matching algorithm is doing:
ol_running.loc[~ol_running['Time'].isnull()].head()
From a visual scan of the three columns corresponding to athlete name, it looks like the fuzzy matching algorithm is doing a good job of finding the correct names. The matching algorithm uses a threshold value of 50 for the match ratio. As a further check, examine the matches with the lowest match ratio:
ol_running[ol_running['Ratio'] < 70]
So there are only seven values with a match ratio below 70, and they all look correct. So the matching algorithm seems to be working well.
ol_running.loc[~ol_running['Time'].isnull()].info()
So this method has merged in 1177 time data fields. Next, merge in times from the top_running data frame. This is a little more complicated because it is necessary to group on events marked as True in the 'Olympics' feature of this data frame to screen out other performances by the same athlete in the same year. In addition, some athletes may run several heats and a final in a single Games. Therefore, it is necessary to use the time from the race with the latest date, and within the period of the Games in question. If there turn out to be more than one (e.g., if a final and a heat were run on the same day) then we choose one arbitrarily. This is not a huge problem, since we are attempting to relate performances to height, weight and age, and those factors will not change within one day anyway.
The ol_tf_running data set contained only medal-winning performances, so it was (almost) guaranteed that there would be a corresponding row in the ol_running data set. The top_running data set differs from the ol_tf_running data set in that it contains many non-medal winning performances. Therefore, it's not possible to use the 'Medal' field to group the performances and use that to help match them. This means there is a larger scope for false positives, where the fuzzy matching algorithm wrongly identifies two similar names as a match. To help solve this, the match ratio threshold is raised from 50 to 80 in this function. It's not straightforward to combine this extra complexity into the existing merge_times function, so write a new function to handle this.
def merge_times_ext(df, event_categories, debug=False):
"""
Helper function to merge times from one dataset into the ol_running dataset.
Event, year, gender, and athlete name are used as inputs to match athlete data from
one data frame to the same athlete's performance in the other data frame.
Names are matched using fuzzy string matching.
Input parameters:
df - data frame to merge
event_categories - list of categories of events
debug - True/False flag to indicate whether to print out debugging information.
Returns:
None
"""
for category in event_categories:
if debug:
print(category)
ol_running_groups = ol_running.groupby([category, 'Year', 'Gender'])
df_groups = df.groupby([category, 'Year', 'Gender', 'Olympics'])
events = ol_running[category].unique().tolist()
events.remove(0)
for event in events:
if debug:
print(event)
for gender in ol_running['Gender'].unique().tolist():
if debug:
print(gender)
for year in ol_running['Year'].unique().tolist():
if debug:
print(year)
try:
group_1 = ol_running_groups.get_group((event, year, gender))
except KeyError:
if debug:
print("No results for this combination in ol_running_groups")
continue
try:
group_2 = df_groups.get_group((event, year, gender, True))
except KeyError:
if debug:
print("No results for this combination in df_groups")
continue
name_options = group_1['Name'].tolist()
for name in group_2['Name']:
find_result = group_1['Name'].str.find(name)
i = find_result[find_result>-1].index
if debug:
print(i)
if(i.any()):
if debug:
print("str.find found a match: {}".format(name))
# Don't replace a time if one has already been found for this row
if pd.isnull(ol_running.loc[i]['Time'].tolist()):
latest_race_date = group_2.loc[(group_2['Name']==name, 'Date')].max()
ol_running.loc[i, 'Time'] = group_2.loc[
(group_2['Name'] == name) &
(group_2['Date'] == latest_race_date)]['Time'].tolist()[0]
else:
if debug:
print("str.find did NOT find a match:")
best_match = process.extractOne(name, name_options)
if debug:
print(best_match)
print("Best name: {}".format(best_match[0]))
print("Match confidence: {}".format(best_match[1]))
print("index={}".format(group_1[group_1['Name']==best_match[0]].index))
if best_match[1] > 80:
i=group_1[group_1['Name']==best_match[0]].index
# Don't replace a time if one has already been found for this row
if pd.isnull(ol_running.loc[i]['Time'].tolist()):
ol_running.loc[i, 'Merged_name'] = name
ol_running.loc[i, 'Ratio'] = best_match[1]
latest_race_date = group_2.loc[(group_2['Name']==name, 'Date')].max()
ol_running.loc[i, 'Time'] = group_2.loc[
(group_2['Name'] == name) &
(group_2['Date'] == latest_race_date)]['Time'].tolist()[0]
event_categories = ['Track_Flat', 'Road']
merge_times_ext(top_running, event_categories)
ol_running.loc[~ol_running['Time'].isnull()].head()
ol_running.loc[~ol_running['Time'].isnull()].info()
Merging the second data set in has increased the number of rows with a time by a few hundred.
To investigate this plot athletes' results (i.e. times) against the date of the performance. This will be done individually for each event and separately for each gender. The plots use colour to identify Olympic medal winning performances (see the key). Both the ol_tf_running and top_running datasets are used for this. The analysis will follow in section 5.1.
# Group by gender
top_running_gender_groups = top_running.groupby(['Gender', 'Top 10'])
top_running_m = top_running_gender_groups.get_group(('M', True))
top_running_f = top_running_gender_groups.get_group(('F', True))
ol_tf_running_gender_groups = ol_tf_running.groupby('Gender')
ol_tf_running_m = ol_tf_running_gender_groups.get_group('M')
ol_tf_running_f = ol_tf_running_gender_groups.get_group('F')
def build_graph_labels(gender, category, event, characteristic=None):
"""
build_graph_labels
Helper function to create strings to use in constructing the graph title
Input parameters:
gender - athlete gender group
category - type of event
event - specific distance
characteristic - athlete characteristic, default None
Returns:
gender_label - Readable gender string
event_label - Readable event name string
category_label - Readable event category string
unit - Unit for the characteristic to plot
"""
if gender=='M':
gender_label = "Male"
else:
gender_label = "Female"
if category == 'Road':
if event == 42195:
event_label = 'Marathon'
if event == 21098:
event_label = 'Half Marathon'
category_label = 'Road Running'
if category == 'Track_Flat':
event_label = str(event)+'m'
category_label = 'Track (Flat)'
if category == 'Hurdles':
event_label = str(event)+'m'+' Hurdles'
category_label = 'Hurdles'
if category == 'Steeplechase':
event_label = str(event)+'m'+' Steeplechase'
category_label = 'Steeplechase'
if characteristic == 'Height':
unit='cm'
elif characteristic == 'Weight':
unit='kg'
elif characteristic == 'Age':
unit='years'
elif characteristic == 'BMI':
unit='m/kg*kg'
else:
unit=None
return gender_label, event_label, category_label, unit
def update_min_max(x_series, y_series, x_min, x_max, y_min, y_max):
"""
Helper to update minimum and maximum values of the x and y series
Input parameters:
x_series - A datetime.date series
y_series - A datetime.time list
x_min - earliest date found so far
x_max - latest date found so far
y_min - shortest time found so far
y_max - longest time found so far
Returns:
x_min - Earliest date
x_max - Latest date
y_min - Smallest time
y_max - Largest time
"""
if x_series.min() < x_min:
x_min = x_series.min()
if x_series.max() > x_max:
x_max = x_series.max()
if min(y_series) < y_min:
y_min = min(y_series)
if max(y_series) > y_max:
y_max = max(y_series)
return x_min, x_max, y_min, y_max
def plot_times(event_categories, debug=False):
"""
Helper function to plot finish times for athletes across all events.
Input parameters:
event_categories - list of categories of events
debug - True/False flag to indicate whether to print out debugging information.
Returns:
None
"""
global graph_number
summary_strings = []
x_min = datetime(2020, 1, 1, 0, 0, 0, 0)
x_max = datetime(1896, 1, 1, 0, 0, 0, 0)
y_min = time(23, 0, 0)
y_max = time(0, 0, 0)
top_running_data_present = True
ol_tf_running_data_present = True
for category in event_categories:
events = ol_running[category].unique().tolist()
events.remove(0)
events.sort()
if category == 'Road':
events.append(21098)
for event in events:
if debug:
print("Category = {}, Event={}".format(category, event))
for gender in ol_tf_running_gender_groups.groups.keys():
if gender == 'M':
top_running_group = top_running_m
ol_tf_running_group = ol_tf_running_m
else:
top_running_group = top_running_f
ol_tf_running_group = ol_tf_running_f
plt.figure(figsize=(18, 9))
# Plot top running time data, if it exists
try:
x_series = top_running_group[top_running_group[category] == event]['Date']
y_series = list(top_running_group[top_running_group[category] == event]['Time'])
plt.scatter(x_series, y_series, color='b', label='Top 10 results in year')
x_min, x_max, y_min, y_max = update_min_max(x_series, y_series,
x_min, x_max, y_min, y_max)
except KeyError:
if debug:
print("No data from top running times for this event.")
top_running_data_present = False
# Plot each olympic medal colour, if data exists for this event
if ol_tf_running_group[(ol_tf_running_group[category] == event)].shape[0] != 0:
x_series = pd.to_datetime(ol_tf_running_group[(ol_tf_running_group[category] == event) &
(ol_tf_running_group['Medal'] == 'G')]['Year'],
format='%Y')
y_series = list(ol_tf_running_group[(ol_tf_running_group[category] == event) &
(ol_tf_running_group['Medal'] == 'G')]['Time'])
plt.scatter(x_series, y_series, color='gold', label='Olympic gold medal')
x_min, x_max, y_min, y_max = update_min_max(x_series, y_series,
x_min, x_max, y_min, y_max)
x_series = pd.to_datetime(ol_tf_running_group[(ol_tf_running_group[category] == event) &
(ol_tf_running_group['Medal'] == 'S')]['Year'],
format='%Y')
y_series = list(ol_tf_running_group[(ol_tf_running_group[category] == event) &
(ol_tf_running_group['Medal'] == 'S')]['Time'])
plt.scatter(x_series, y_series, color='silver', label='Olympic silver medal')
x_min, x_max, y_min, y_max = update_min_max(x_series, y_series,
x_min, x_max, y_min, y_max)
x_series = pd.to_datetime(ol_tf_running_group[(ol_tf_running_group[category] == event) &
(ol_tf_running_group['Medal'] == 'B')]['Year'],
format='%Y')
y_series = list(ol_tf_running_group[(ol_tf_running_group[category] == event) &
(ol_tf_running_group['Medal'] == 'B')]['Time'])
plt.scatter(x_series, y_series, color='brown', label='Olympic bronze medal')
x_min, x_max, y_min, y_max = update_min_max(x_series, y_series,
x_min, x_max, y_min, y_max)
else:
ol_tf_running_data_present = False
if debug:
print("No data from Olympic data set for this event.")
# Only plot if there is some data
if (ol_tf_running_data_present == True) or (top_running_data_present == True):
# Construct the graph title and axis labels
gender_label, event_label, category_label, unit = build_graph_labels(
gender, category, event)
plt.xlabel('Year')
plt.ylabel('Time')
plt.title("Graph {0}: Variation in {1} {2} Times".format(graph_number,
gender_label,
event_label))
plt.legend()
plt.show()
graph_number+=1
# Print some information about the variation in results in this graph
# Calculate difference between quickest and slowest times in the graph
y_delta = datetime.combine(
date.today(), y_max) - datetime.combine(
date.today(), y_min)
#Calculate difference as a proportion of the slowest time
y_proportion = y_delta.total_seconds() / (datetime.combine(date.today(), y_min) -
datetime.combine(date.today(),
time(0, 0, 0))).total_seconds()
# Calculate length of time we have plotted data over
x_delta = x_max - x_min
summary_strings.append("Top times in {0} {1} span a range of {2} seconds ({3:.2f}%) in {4:.1f} years of history".format(
gender_label,
event_label,
y_delta.total_seconds(),
y_proportion*100,
x_delta.days/365))
print("Top times in {0} {1} span a range of {2} seconds ({3:.2f}%) in {4:.1f} years of history".format(
gender_label,
event_label,
y_delta.total_seconds(),
y_proportion*100,
x_delta.days/365))
# Reset
top_running_data_present = True
ol_tf_running_data_present = True
x_min = datetime(2020, 1, 1, 0, 0, 0, 0)
x_max = datetime(1896, 1, 1, 0, 0, 0, 0)
y_min = time(23, 0, 0)
y_max = time(0, 0, 0)
print("Summary")
for summary in summary_strings:
print(summary)
# A label for the graphs plotted
graph_number = 1
# Plot graphs for all events
event_categories = ['Track_Flat', 'Steeplechase', 'Hurdles', 'Road']
plot_times(event_categories)
The analysis for the question "How have athletes' performances changed through history?" can be found in section 5.1.
This will be examined by plotting the mean value of each of the four athlete characteristics (height, weight, age, BMI) against year of competition. Multiple events are plotted on the same axis for ease of comparison. The Numpy polyfit() method is used to plot a best fit line for each event.
The analysis can be found in section 5.2
def plot_athlete_characteristics(event_categories, characteristics, debug=False):
"""
Helper function to plot athlete characteristics (height, weight, age, BMI) against year of competition.
Multiple events are plotted on the same axis for comparison.
A best fit line is added for each event in the plot.
Input parameters:
event_categories - list of categories of events
characteristics - list of athlete characteristics to plot
debug - True/False flag to indicate whether to print out debugging information.
Returns:
None
"""
meanvals = []
years = []
global graph_number
for c in characteristics:
for category in event_categories:
if debug:
print(category)
ol_running_groups = ol_running.groupby([category, 'Gender'])
events = ol_running[category].unique().tolist()
events.remove(0)
for gender in ol_running['Gender'].unique().tolist():
plt.figure(figsize=(18, 9))
if debug:
print(gender)
for event in events:
if debug:
print(event)
try:
ol_running_group = ol_running_groups.get_group((event, gender))
except KeyError:
if debug:
print("No results for this combination in ol_running_groups")
continue
for year in ol_running_group['Year'].unique().tolist():
meanvals.append(ol_running_group[ol_running_group['Year'] == year][c].mean())
years.append(year)
plt.scatter(years, meanvals, label=str(event)+'m')
# Remove NaN values - these will break the fit used by polyfit() below
nullvals = np.isnan(meanvals)
for i in np.where(nullvals)[0]:
meanvals.pop(i)
years.pop(i)
z = np.polyfit(years, meanvals, 1)
p = np.poly1d(z)
plb.plot(years, p(years))
meanvals = []
years = []
#Construct the graph title and axis labels
gender_label, event_label, category_label, unit = build_graph_labels(gender, category, event, c)
plt.xlabel('Year')
plt.ylabel(c+'({})'.format(unit))
plt.title("Graph {0}: Variation in Mean {1} of {2} {3} Olympic Athletes Through History".format(
graph_number, c, gender_label, category_label))
plt.legend()
plt.show()
graph_number+=1
Calculate body mass index (BMI):
ol_running.insert(loc=ol_running.columns.get_loc('Weight'), column='BMI', value=0)
ol_running['BMI'] = ol_running['Weight'] / ((ol_running['Height'] / 100)**2)
ol_running.head()
characteristics = ['Height', 'Weight', 'Age', 'BMI']
plot_athlete_characteristics(event_categories, characteristics)
The analysis can be found in section 5.2
This will be examined by plotting each of the four athlete characteristics against their performance (i.e., time) for each event and gender. Colours are used to indicate medal-winning performances as indicated by the key.
The analysis can be found in section 5.3
ol_running_gender_groups = ol_running.groupby('Gender')
ol_running_m = ol_running_gender_groups.get_group('M')
ol_running_f = ol_running_gender_groups.get_group('F')
def plot_time_vs_characteristics(event_categories, characteristics, debug=False):
"""
Helper function to plot athlete characteristics (height, weight, age, BMI) against time.
Input parameters:
event_categories - list of categories of events
characteristics - list of athlete characteristics to plot
debug - True/False flag to indicate whether to print out debugging information.
Returns:
None
"""
global graph_number
data_present = True
for c in characteristics:
for category in event_categories:
events = ol_running[category].unique().tolist()
events.remove(0)
events.sort()
for event in events:
if debug:
print("Category = {}, Event={}".format(category, event))
for gender in ol_running_gender_groups.groups.keys():
if gender == 'M':
ol_running_group = ol_running_m
else:
ol_running_group = ol_running_f
plt.figure(figsize=(18, 9))
# Plot each olympic medal colour, if data exists for this event
if ol_running_group[(ol_running_group[category] == event)].shape[0] != 0:
# Plot vertical lines showing mean and standard deviation of the charateristic
meanval = ol_running_group[(ol_running_group[category] == event) &
(ol_running_group['Time'].notna())][c].mean()
stddev = ol_running_group[(ol_running_group[category] == event) &
(ol_running_group['Time'].notna())][c].std()
representative_max_time = max(list(ol_running_group[(ol_running_group[category] == event) &
(ol_running_group['Medal'] == 'B') &
(ol_running_group['Time'].notna())]['Time']))
representative_min_time = min(list(ol_running_group[(ol_running_group[category] == event) &
(ol_running_group['Medal'] == 'B') &
(ol_running_group['Time'].notna())]['Time']))
plt.plot([meanval, meanval],
[representative_min_time, representative_max_time],
color='red', label='Mean value of characteristic')
plt.plot([meanval+stddev, meanval+stddev],
[representative_min_time, representative_max_time],
color='pink', label='Mean +/- standard deviation')
plt.plot([meanval-stddev, meanval-stddev],
[representative_min_time, representative_max_time],
color='pink', label='Mean +/- standard deviation')
plt.scatter(ol_running_group[(ol_running_group[category] == event) &
(ol_running_group['Medal'] == 'G') &
(ol_running_group['Time'].notna())][c],
list(ol_running_group[(ol_running_group[category] == event) &
(ol_running_group['Medal'] == 'G') &
(ol_running_group['Time'].notna())]['Time']),
color='gold', label='Olympic gold medal')
plt.scatter(ol_running_group[(ol_running_group[category] == event) &
(ol_running_group['Medal'] == 'S') &
(ol_running_group['Time'].notna())][c],
list(ol_running_group[(ol_running_group[category] == event) &
(ol_running_group['Medal'] == 'S') &
(ol_running_group['Time'].notna())]['Time']),
color='silver', label='Olympic silver medal')
plt.scatter(ol_running_group[(ol_running_group[category] == event) &
(ol_running_group['Medal'] == 'B') &
(ol_running_group['Time'].notna())][c],
list(ol_running_group[(ol_running_group[category] == event) &
(ol_running_group['Medal'] == 'B') &
(ol_running_group['Time'].notna())]['Time']),
color='brown', label='Olympic bronze medal')
plt.scatter(ol_running_group[(ol_running_group[category] == event) &
(ol_running_group['Medal'].isnull()) &
(ol_running_group['Time'].notna())][c],
list(ol_running_group[(ol_running_group[category] == event) &
(ol_running_group['Medal'].isnull()) &
(ol_running_group['Time'].notna())]['Time']),
color='blue', label='No medal')
else:
data_present = False
if debug:
print("No data from Olympic data set for this event.")
# Only plot if there is some data
if data_present == True:
# Construct the graph title and axis labels
gender_label, event_label, category_label, unit = build_graph_labels(
gender, category, event, c)
plt.xlabel(c+'({})'.format(unit))
plt.ylabel('Time')
plt.title("Graph {0}: Variation in {1} {2} Times with {3}".format(
graph_number, gender_label, event_label, c))
plt.legend()
plt.show()
graph_number+=1
data_present = True
# Plot graphs for all events
event_categories = ['Track_Flat', 'Steeplechase', 'Hurdles', 'Road']
plot_time_vs_characteristics(event_categories, characteristics)
Now display the same set of results but colour coded to show the 20-year time period they fall into. This is to look for any relationship between the characterisitc and the results in a particular year group.
# Create a set of year groups throughout Olympic history
year_groups = np.arange(1896, 2020, 20)
year_groups[-1] += 1 # To include 2016 Games
year_groups
def plot_time_vs_characteristics_time_groups(event_categories, characteristics, debug=False):
"""
Helper function to plot athlete characteristics (height, weight, age, BMI) against time.
This function uses colour codes to show the 20-year time period into which a performaance falls.
Input parameters:
event_categories - list of categories of events
characteristics - list of athlete characteristics to plot
debug - True/False flag to indicate whether to print out debugging information.
Returns:
None
"""
global graph_number
data_present = True
for c in characteristics:
for category in event_categories:
events = ol_running[category].unique().tolist()
events.remove(0)
events.sort()
for event in events:
if debug:
print("Category = {}, Event={}".format(category, event))
for gender in ol_running_gender_groups.groups.keys():
if gender == 'M':
ol_running_group = ol_running_m
else:
ol_running_group = ol_running_f
plt.figure(figsize=(18, 9))
# Plot
if ol_running_group[(ol_running_group[category] == event)].shape[0] != 0:
for y in range(len(year_groups)-1):
plt.scatter(ol_running_group[(ol_running_group[category] == event) &
(ol_running_group['Year'] >= year_groups[y]) &
(ol_running_group['Year'] < year_groups[y+1]) &
(ol_running_group['Time'].notna())][c],
list(ol_running_group[(ol_running_group[category] == event) &
(ol_running_group['Year'] >= year_groups[y]) &
(ol_running_group['Year'] < year_groups[y+1]) &
(ol_running_group['Time'].notna())]['Time']),
label='{0} to {1}'.format(year_groups[y], year_groups[y+1]-1))
# Calculate the mean value of the characteristic based on the most recent 20 year period
meanval = ol_running_group[(ol_running_group[category] == event) &
(ol_running_group['Year'] >= year_groups[y]) &
(ol_running_group['Year'] < year_groups[y+1]) &
(ol_running_group['Time'].notna())][c].mean()
stddev = ol_running_group[(ol_running_group[category] == event) &
(ol_running_group['Year'] >= year_groups[y]) &
(ol_running_group['Year'] < year_groups[y+1]) &
(ol_running_group['Time'].notna())][c].std()
representative_max_time = max(list(ol_running_group[(ol_running_group[category] == event) &
(ol_running_group['Year'] >= year_groups[y]) &
(ol_running_group['Year'] < year_groups[y+1]) &
(ol_running_group['Time'].notna())]['Time']))
representative_min_time = min(list(ol_running_group[(ol_running_group[category] == event) &
(ol_running_group['Year'] >= year_groups[y]) &
(ol_running_group['Year'] < year_groups[y+1]) &
(ol_running_group['Time'].notna())]['Time']))
plt.plot([meanval, meanval],
[representative_min_time, representative_max_time],
color='brown', label='Mean value of characteristic, 1996-2016')
plt.plot([meanval+stddev, meanval+stddev],
[representative_min_time, representative_max_time],
color='brown', label='Mean value +/- standard deviation, 1996-2016')
plt.plot([meanval-stddev, meanval-stddev],
[representative_min_time, representative_max_time],
color='brown', label='Mean value +/- standard deviation, 1996-2016')
else:
data_present = False
if debug:
print("No data from Olympic data set for this event.")
# Only plot if there is some data
if data_present == True:
# Construct the graph title and axis labels
gender_label, event_label, category_label, unit = build_graph_labels(
gender, category, event, c)
plt.xlabel(c+'({})'.format(unit))
plt.ylabel('Time')
plt.title("Graph {0}: Variation in {1} {2} Times with {3}".format(
graph_number, gender_label, event_label, c))
plt.legend()
plt.show()
graph_number+=1
data_present = True
plot_time_vs_characteristics_time_groups(event_categories, characteristics)
The analysis can be found in section 5.3
This is an analysis for the results in section 4.1
Graphs 1-24 show that top performances have improved for almost every event and gender combination. The one and only exception is women's 3000m steeplechase, but that is because we have very limied data (spanning only 8 years). All other events show an improvement over time.
The biggest proportional improvement is 85% in the men's marathon. But compare this to the improvement in the men's half marathon: a much smaller 4.2%. This illustrates an important point: the men's marathon has 121 years of data, but the half marathon has only the last 31 years of data. The difference in the improvement for the men's marathon and the half marathon is partly because the rate of improvement for every event was much steeper prior to about 1960. Since then, results have improved much more slowly. This can also be seen in the shape of the graphs, which often have a shape like an exponential decay curve.
Many of the men's events have improved in the range of 20-30% throughout the data set we have available. There is no clear relationship between the distance run and the proportional improvement. For example, two of the sprint events (100m and 200m) have improved by 31.5% and 19.3% respectively, and two of the longer track events (5000m and 10000m) have improved by 20.6% and 29.0% respectively. So, a similar range of improvement at two different ends of the track running distances.
The decrease in the rate of improvement since about 1960 also explains why the proportional improvement in women's events is often less than that of the equivalent men's event. Women have been only been allowed to participate at all distances relatively recently, so there is much less data available and it does not stretch back in time as far as for the men's events. However, where the length of history is comparable for each gender, it's clear that women and men are improving at a comparable rate. A good example is the half marathon, which have 34.5 years of history for the women, and 0.8 years for the men. The proportional improvements are similar: 7.37% for women, 4.22% for men.
Many of the women's events have improved in the range of 5-20%. The events with the longest history tend to be the ones which show the largest improvement, for the reasons already discussed.
This is the analysis for the results in section 4.2. Multiple events are shown on the same plot, and a best fit trend line is shown for each event.
For the track (flat) events, both male and female athletes show a similar relationship between the event distance and athlete height.
All the female flat track athletes have got slightly taller, with 1500m showing the biggest increase.
For men, 400m, 200m and 800m athletes have tended to get taller. 100m athletes have got shorter. For the other distances, the data doesn't go back very far, so from the data available these remaining heights seem unchanged through history.
There is insufficient data to draw any conclusions about the women's steeplechase. The men's event shows a clear increase in height.
Mean height increased for both women and men in all hurdles events. For men, the longer distance hurdlers (400m) tend to be taller than the shorter distances, but the reverse is true for women. The hurdlers tend to be slightly taller than athletes at their flat track equivalent distances.
Male marathoners show a very slight tendency to increase in mean height over time, and female marathoners show no change (though with a small range of available data). The marathon runners of both genders have similar height to the 10000m track runners.
For both male and female athletes there is a relationship between weight and distance of the event.
There is insufficient data to draw any conclusions about the women's steeplechase. There is no change shown in mean weight of male athletes, and this is at a similar level to male steeplechase or 5000m runners.
Mean weight increased for both women and men in all hurdles events. For men, the short distance hurdlers tend to be heavier than the longer distance hurdlers, but the reverse is true for women. The hurdlers tend to be heavier than athletes at their flat track equivalent distances, except for for female 400m hurdlers/flat runners, which have similar mean weights.
The male and female marathoners have similar weights to the equivalent 5000m and 10000m athletes. The mean weights for both remain fairly constant through history.
(Side note about the men's results (Graph 40): there is one suspicious-looking point for the 1896 Games, of an anomalously heavy athlete. This is indeed a valid data point, a competitor who weighed 106 kg. He finished 6th out of 17 athletes in the 1896 Olympic marathon, although sadly we do not know his time. He must have been quite an unusual long distance athlete with this weight. Source: https://en.wikipedia.org/wiki/Dimitrios_Deligiannis.)
For every event and gender, mean age of the competitors has increased through history.
For both genders, there is a general trend that runners of the longer distances tend to have a higher age than those of the shorter distances.
There is insufficient data to draw any conclusions about the women's steeplechase. The mean age of the male competitiors is similar to that of the 5000m flat track athletes and has stayed fairly constant.
The mean ages for both distances of hurdling, and for both genders, are similar, and also similar to the equivalent flat track distances.
The mena age of marathon runners of both genders is slightly higher than that of any of the long distance track athletes.
There is a very clear link between the distance of the race and mean athlete BMI. The shorter the distance, the higher the BMI. This is true for both genders. In addition, Graphs 49 and 50 show an approximately diverging set of lines, meaning the trend is for the mean BMI of shorter distance athletes tends to increase through history, whilst it tends to decrease for the longer distance events. The separation between events where BMI is increasing an decreasing comes between 400m and 800m.
There is insufficient data to draw any conclusions about the women's steeplechase. For men, mean BMI is tending to decrease through history, and is similar to the mean BMI of 5000m athletes.
Mean BMI has tended to increase for all events and genders except male 400m hurdles, where it has stayed fairly constant. Female 100m hurdlers have similar mean BMI to 100m flate track females, but female 400m hurdlers have a lower BMI than 400m track flat females. Male hurdlers have very similar man BMI to the euivalent track flat males.
Mean BMIs have tended to show a slight decrease through history. The mean BMIs of marathoners of both genders are similar, or slightly higher than the mean BMI of 10000m and 5000m track runners.
(Side note: Graph 56 shows the same anomalous (but real) data point for the result in 1896. See the explanation in section 5.2.2.4..)
This is the analysis for the results in section 4.3.
This set of graphs looks for a relationship between any of the four athlete characteristics and their performance.
For height, most of the results tend to cluster near the mean height (bell curve shape). This includes both good and less good results, medal winners and non-medal winners, so really this is just reflecting the distribution of heights of the athletes for a particular event - in other words, there are more athletes with heights close to the mean for that particular event.
Results versus weights also often clusters near the mean in a bell curve shape. There are some cases where the curve is more skewed. For example Graph 80 shows more results above the mean value than below, but fewer and more consistently good results above the mean value of weight. This reflects the fact observed in the previous section that short distance athletes have tended to get heavier through history, and this has accompanied the improvement in performance.
For age, the results also tend to cluster about the mean. However, as with weight, the results sometimes follow a more skewed distribution, with the largest number of results below the mean age, and more consistently good results at higher age. This is clearer in the cases where more data points are available. This also partly reflects the changes in athlete population: section 5.2.3. noted that athletes at all events have tended to get older through history. We can observe that as mean athlete age has increased, athlete performance has also improved.
The BMI graphs have slightly different distributions for short distance, middle distance and long distance events.
This is consistent with section 5.2.4.) which showed that BMIs for short distances have tended to increase, for middle distances they have stayed fairly constant, and for long distances they have tended to decrease. This has accompanied an improvement in results for all events.
It's clear that elite running results have improved and some athlete characteristics have changed in certain events through history. To attempt to compare like with like, these graphs show the results in groups falling into 20 year periods. In particular, we are interested in the most recent results (from 1996-2016). The intention is that this will mitigate the effect of factors that may influence performance, such as training techniques, improved health, improved facilities etc, so that we can isolate the effect of height, weight, ange and BMI.
It's clear that the most recent results occupy a much narrower band of times, from the minimum to the maximum, and the mean result will certainly have improved. The range of each charteristic (height, weight, age, BMI) is still very broad even in this modern period. So although we can observe that certain characteristics have altered through history, and this has accompanied an improvement in performance, athletes with a wide range of characteristics are still able to give excellent performances. The top performances are widely spread, covering a range of one to two standard deviations either side of the mean.
Performances have improved across all events and genders. This reflects improvements in training techniques, general health, facilities and the application of science to sport. The rate of improvement has slowed down since about 1960, which is probably due to diminishing returns as ahletes and coaches get better.
Men's performances have generally improved in the range of 20-30%. Women's performances have generally improved in the range 5-20%. However, this difference stems largely from the fact that women have only be able to compete at all distances for a relatively short time, mostly after 1960, which is when the rate of improvement has generally been slower in all events. When there is a comparable data history for both men and women in the same event (e.g., half marathon) then the improvements in both ethe men's and women's events are similar.
It's clear that athlete characteristics are important in determining which event an athlete will compete in: an athlete with a particular combination of height, weight, age and BMI (by implication) will be suited to particular events and less suited to others. This shows there is a relationship between athlete characteristics and performance.
This analysis also looked at single events in isolation and looked at the athletes competing in that event. It showed that although the characteristics of athletes for a particular event cluster about a mean, and the combination is characteristic of that event, athletes with a wide range of individual height, weight, BMI and age still produce top performances. The top performances do not cluster very closely to the mean - some top performances fall 1-2 standard deviations away from the mean.
The lesson for the aspiring athlete are:
Many factors influence athletes' performance, and this study has looked at just a few of them. A more extensive study would examine the effect of factors such as training schedules (e.g., distance run per week, techniques such as weight training), diet, and athlete characteristics such as VO2 max.
This analysis is published on GitHub (https://github.com/mattjezza/ds-proj1-t2-elite-athletics) and summarised in a post on Medium.